The Universal Coded Character Set (UCS, Unicode) is a standard set of
characters
Character or Characters may refer to:
Arts, entertainment, and media Literature
* ''Character'' (novel), a 1936 Dutch novel by Ferdinand Bordewijk
* ''Characters'' (Theophrastus), a classical Greek set of character sketches attributed to The ...
defined by the
international standard
international standard is a technical standard developed by one or more international standards organizations. International standards are available for consideration and use worldwide. The most prominent such organization is the International Or ...
ISO
ISO is the most common abbreviation for the International Organization for Standardization.
ISO or Iso may also refer to: Business and finance
* Iso (supermarket), a chain of Danish supermarkets incorporated into the SuperBest chain in 2007
* Iso ...
/
IEC
The International Electrotechnical Commission (IEC; in French: ''Commission électrotechnique internationale'') is an international standards organization that prepares and publishes international standards for all electrical, electronic and r ...
10646, ''Information technology — Universal Coded Character Set (UCS)'' (plus amendments to that standard), which is the basis of many
character encoding
Character encoding is the process of assigning numbers to Graphics, graphical character (computing), characters, especially the written characters of Language, human language, allowing them to be Data storage, stored, Data communication, transmi ...
s, improving as characters from previously unrepresented typing systems are added.
The UCS has over 1.1 million possible code points available for use/allocation, but only the first 65,536, which is the
Basic Multilingual Plane
In the Unicode standard, a plane is a continuous group of 65,536 (216) code points. There are 17 planes, identified by the numbers 0 to 16, which corresponds with the possible values 00–1016 of the first two positions in six position hexadecimal ...
(BMP), had entered into common use before 2000. This situation began changing when the
People's Republic of China
China, officially the People's Republic of China (PRC), is a country in East Asia. It is the world's most populous country, with a population exceeding 1.4 billion, slightly ahead of India. China spans the equivalent of five time zones and ...
(PRC) ruled in 2006 that all software sold in its jurisdiction would have to support
GB 18030
GB 18030 is a Chinese government standard, described as ''Information Technology — Chinese coded character set'' and defines the required language and character support necessary for software in China. GB18030 is the registered Internet ...
. This required software intended for sale in the PRC to move beyond the BMP.
The system deliberately leaves many code points not assigned to characters, even in the BMP. It does this to allow for future expansion or to minimise conflicts with other encoding forms.
The original edition of the UCS defined
UTF-16
UTF-16 (16-bit computing, 16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variab ...
, an extension of UCS-2, to represent code points outside the BMP. A range of code points in the S (Special) Zone of the BMP remains unassigned to characters. UCS-2 disallows use of code values for these code points, but UTF-16 allows their use in pairs. Unicode also adopted UTF-16, but in Unicode terminology, the high-half zone elements become "high surrogates" and the low-half zone elements become "low surrogates".
Another encoding,
UTF-32
UTF-32 (32-bit Unicode transformation format, Unicode Transformation Format) is a fixed-length Character encoding, encoding used to encode Unicode code points that uses exactly 32 bits (four bytes) per code point (but a number of leading bits must ...
(previously named UCS-4), uses four bytes (total 32 bits) to encode a single character of the codespace. UTF-32 thereby permits a binary representation of every code point in the APIs, and software applications.
History
The
International Organization for Standardization
The International Organization for Standardization (ISO ) is an international standard development organization composed of representatives from the national standards organizations of member countries. Membership requirements are given in Ar ...
(ISO) set out to compose the universal character set in 1989, and published the draft of ISO 10646 in 1990.
Hugh McGregor Ross
Hugh McGregor Ross (31 August 1917 – 1 September 2014) was an early pioneer in the history of British computing. He was employed by Ferranti from the mid-1960s, where he worked on the Pegasus thermionic valve computer. He was involved in t ...
was one of its principal architects. That standard differed markedly from the current one. It defined:
* 128 groups of
* 256 planes of
* 256 rows of
* 256 cells,
for an apparent total of 2,147,483,648 characters, but actually the standard could code only 679,477,248 characters, as the policy forbade byte values of
C0 and C1 control codes
The C0 and C1 control code or control character sets define control codes for use in text by computer systems that use ASCII and derivatives of ASCII. The codes represent additional information about the text, such as the position of a cursor, ...
(0x00 to 0x1F and 0x80 to 0x9F, in
hexadecimal
In mathematics and computing, the hexadecimal (also base-16 or simply hex) numeral system is a positional numeral system that represents numbers using a radix (base) of 16. Unlike the decimal system representing numbers using 10 symbols, hexa ...
notation) in any one of the four bytes specifying a group, plane, row and cell. The Latin capital letter A, for example, had a location in group 0x20, plane 0x20, row 0x20, cell 0x41.
One could code the characters of this primordial ISO/IEC 10646 standard in one of three ways:
# UCS-4, four bytes for every character, enabling the simple encoding of all characters;
# UCS-2, two bytes for every character, enabling the encoding of the first plane, 0x20, the Basic Multilingual Plane, containing the first 36,864 codepoints, straightforwardly, and other planes and groups by switching to them with
ISO/IEC 2022
ISO/IEC 2022 ''Information technology—Character code structure and extension techniques'', is an ISO/IEC standard (equivalent to the ECMA standard ECMA-35, the ANSI standard ANSI X3.41 and the Japanese Industrial Standard JIS X 0202) in the f ...
escape sequences;
#
UTF-1
UTF-1 is a method of transforming ISO/IEC 10646/Unicode into a stream of bytes. Its design does not provide self-synchronization, which makes searching for substrings and error recovery difficult. It reuses the ASCII printing characters for mult ...
, which encodes all the characters in sequences of bytes of varying length (1 to 5 bytes, each of which contain no control codes).
In 1990, therefore, two initiatives for a universal character set existed:
Unicode
Unicode, formally The Unicode Standard,The formal version reference is is an information technology Technical standard, standard for the consistent character encoding, encoding, representation, and handling of Character (computing), text expre ...
, with 16 bits for every character (65,536 possible characters), and ISO/IEC 10646. The software companies refused to accept the complexity and size requirement of the ISO standard and were able to convince a number of ISO National Bodies to vote against it. ISO officials realised they could not continue to support the standard in its current state and negotiated the unification of their standard with Unicode. Two changes took place: the lifting of the limitation upon characters (prohibition of control code values), thus opening code points for allocation; and the synchronisation of the repertoire of the Basic Multilingual Plane with that of Unicode.
Meanwhile, in the passage of time, the situation changed in the Unicode standard itself: 65,536 characters came to appear insufficient, and the standard from version 2.0 and onwards supports encoding of 1,112,064 code points from
17 planes by means of the
UTF-16
UTF-16 (16-bit computing, 16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variab ...
surrogate mechanism. For that reason, ISO/IEC 10646 was limited to contain as many characters as could be encoded by UTF-16 and no more, that is, a little over a million characters instead of over 679 million. The UCS-4 encoding of ISO/IEC 10646 was incorporated into the Unicode standard with the limitation to the UTF-16 range and under the name
UTF-32
UTF-32 (32-bit Unicode transformation format, Unicode Transformation Format) is a fixed-length Character encoding, encoding used to encode Unicode code points that uses exactly 32 bits (four bytes) per code point (but a number of leading bits must ...
, although it has almost no use outside programs' internal data.
Rob Pike
Robert "Rob" Pike (born 1956) is a Canadian programmer and author. He is best known for his work on the Go programming language and at Bell Labs, where he was a member of the Unix team and was involved in the creation of the Plan 9 from Bell La ...
and
Ken Thompson
Kenneth Lane Thompson (born February 4, 1943) is an American pioneer of computer science. Thompson worked at Bell Labs for most of his career where he designed and implemented the original Unix operating system. He also invented the B programmi ...
, the designers of the
Plan 9 operating system, devised a new, fast and well-designed mixed-width encoding, which came to be called
UTF-8
UTF-8 is a variable-width encoding, variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit'' ...
,
currently the most popular UCS encoding.
Differences from Unicode
ISO/IEC 10646 and Unicode have an identical
repertoire
A repertoire () is a list or set of dramas, operas, musical compositions or roles which a company or person is prepared to perform.
Musicians often have a musical repertoire. The first known use of the word ''repertoire'' was in 1847. It is a l ...
and numbers—the same characters with the same numbers exist on both standards, although Unicode releases new versions and adds new characters more often. Unicode has rules and specifications outside the scope of ISO/IEC 10646. ISO/IEC 10646 is a simple character map, an extension of previous standards like
ISO/IEC 8859
ISO/IEC 8859 is a joint ISO and IEC series of standards for 8-bit character encodings. The series of standards consists of numbered parts, such as ISO/IEC 8859-1, ISO/IEC 8859-2, etc. There are 15 parts, excluding the abandoned ISO/IEC 8859-12. ...
. In contrast, Unicode adds rules for
collation
Collation is the assembly of written information into a standard order. Many systems of collation are based on numerical order or alphabetical order, or extensions and combinations thereof. Collation is a fundamental element of most office fili ...
,
normalisation of forms, and the
bidirectional algorithm
A bidirectional text contains two text directionalities, right-to-left (RTL) and left-to-right (LTR). It generally involves text containing different types of alphabets, but may also refer to boustrophedon, which is changing text direction in eac ...
for
right-to-left
In a script (commonly shortened to right to left or abbreviated RTL, RL-TB or R2L), writing starts from the right of the page and continues to the left, proceeding from top to bottom for new lines. Arabic, Hebrew, Persian, Pashto, Urdu, Kashmiri ...
scripts such as Arabic and Hebrew. For interoperability between platforms, especially if bidirectional scripts are used, it is not enough to support ISO/IEC 10646; Unicode must be implemented.
To support these rules and algorithms, Unicode adds many
properties
Property is the ownership of land, resources, improvements or other tangible objects, or intellectual property.
Property may also refer to:
Mathematics
* Property (mathematics)
Philosophy and science
* Property (philosophy), in philosophy and ...
to each character in the set such as properties determining a character's default bidirectional class and properties to determine how the character combines with other characters. If the character represents a numeric value such as the European number ‘8’, or the vulgar fraction ‘¼’, that numeric value is also added as a property of the character. Unicode intends these properties to support interoperable text handling with a mixture of languages.
Some applications support ISO/IEC 10646 characters but do not fully support Unicode. One such application,
Xterm
In computing, xterm is the standard terminal emulator for the X Window System. It allows users to run programs which require a command-line interface.
If no particular program is specified, xterm runs the user's shell. An X display can show ...
, can properly display all ISO/IEC 10646 characters that have a one-to-one character-to-glyph mapping and a single directionality. It can handle some combining marks by simple overstriking methods, but cannot display Hebrew (bidirectional),
Devanagari
Devanagari ( ; , , Sanskrit pronunciation: ), also called Nagari (),Kathleen Kuiper (2010), The Culture of India, New York: The Rosen Publishing Group, , page 83 is a left-to-right abugida (a type of segmental Writing systems#Segmental syste ...
(one character to many glyphs) or Arabic (both features). Most
GUI
The GUI ( "UI" by itself is still usually pronounced . or ), graphical user interface, is a form of user interface that allows users to interact with electronic devices through graphical icons and audio indicator such as primary notation, inste ...
applications use standard OS text drawing routines which handle such scripts, although the applications themselves still do not always handle them correctly.
Citing the Universal Coded Character Set
''ISO/IEC 10646'', a general, informal citation for the ISO/IEC 10646 family of standards, is acceptable in most prose. And even though it is a separate standard, the term ''Unicode'' is used just as often, informally, when discussing the UCS. However, any normative references to the UCS as a publication should cite the year of the edition in the form ''ISO/IEC 10646:'', for example: ''ISO/IEC 10646:2014''.
Relationship with Unicode
Since 1991, the
Unicode Consortium
The Unicode Consortium (legally Unicode, Inc.) is a 501(c)(3) non-profit organization incorporated and based in Mountain View, California. Its primary purpose is to maintain and publish the Unicode Standard which was developed with the intenti ...
and the
ISO
ISO is the most common abbreviation for the International Organization for Standardization.
ISO or Iso may also refer to: Business and finance
* Iso (supermarket), a chain of Danish supermarkets incorporated into the SuperBest chain in 2007
* Iso ...
/
IEC
The International Electrotechnical Commission (IEC; in French: ''Commission électrotechnique internationale'') is an international standards organization that prepares and publishes international standards for all electrical, electronic and r ...
have developed ''
The Unicode Standard
Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, whic ...
'' ("Unicode") and ISO/IEC 10646 in tandem. The repertoire, character names, and code points of Unicode Version 2.0 exactly match those of ISO/IEC 10646-1:1993 with its first seven published amendments. After Unicode 3.0 was published in February 2000, corresponding new and updated characters entered the UCS via ISO/IEC 10646-1:2000. In 2003, parts 1 and 2 of ISO/IEC 10646 were combined into a single part, which has since had a number of amendments adding characters to the standard in approximate synchrony with the Unicode standard.
* ISO/IEC 10646-1:1993 =
Unicode 1.1
Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, whic ...
* ISO/IEC 10646-1:1993 plus Amendments 5 to 7 =
Unicode 2.0
* ISO/IEC 10646-1:1993 plus Amendments 5 to 7 =
Unicode 2.1 excluding
Euro Sign
The euro sign () is the currency sign used for the euro, the official currency of the eurozone and unilaterally adopted by Kosovo and Montenegro. The design was presented to the public by the European Commission on 12 December 1996. It consists ...
and
Object Replacement Character, which are included in Amendment 18
* ISO/IEC 10646-1:2000 =
Unicode 3.0
* ISO/IEC 10646-1:2000 and ISO/IEC 10646-2:2001 =
Unicode 3.1
Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, whi ...
* ISO/IEC 10646-1:2000 plus Amendment 1 and ISO/IEC 10646-2:2001 =
Unicode 3.2
Unicode, formally The Unicode Standard,The formal version reference is is an information technology Technical standard, standard for the consistent character encoding, encoding, representation, and handling of Character (computing), text expre ...
* ISO/IEC 10646:2003 =
Unicode 4.0
Unicode, formally The Unicode Standard,The formal version reference is is an information technology Technical standard, standard for the consistent character encoding, encoding, representation, and handling of Character (computing), text expre ...
* ISO/IEC 10646:2003 plus Amendment 1 =
Unicode 4.1
* ISO/IEC 10646:2003 plus Amendments 1 to 2 =
Unicode 5.0
Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, wh ...
excluding Devanagari Letters GGA, JJA, DDDA and BBA, which are included in Amendment 3
* ISO/IEC 10646:2003 plus Amendments 1 to 4 =
Unicode 5.1
Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, whic ...
* ISO/IEC 10646:2003 plus Amendments 1 to 6 =
Unicode 5.2
Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, whic ...
* ISO/IEC 10646:2003 plus Amendments 1 to 8 = ISO/IEC 10646:2011 =
Unicode 6.0
Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, whic ...
excluding
Indian Rupee Sign
The Indian rupee sign (₹) is the currency symbol for the Indian rupee (ISO 4217: INR), the official currency of India. Designed by D. Udaya Kumar, it was presented to the public by the Government of India on 15 July 2010, following its selec ...
* ISO/IEC 10646:2012 =
Unicode 6.1
Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, whi ...
* ISO/IEC 10646:2012 =
Unicode 6.2
Unicode, formally The Unicode Standard,The formal version reference is is an information technology Technical standard, standard for the consistent character encoding, encoding, representation, and handling of Character (computing), text expre ...
excluding
Turkish Lira Sign
The lira ( tr, Türk lirası; Currency sign, sign: ₺; ISO 4217, ISO 4217 code: TRY; abbreviation: TL) is the official currency of Turkey and Northern Cyprus. One lira is divided into one hundred ''kuruş''.
History
Ottoman lira (1844–1923) ...
, which is included in Amendment 1
* ISO/IEC 10646:2012 =
Unicode 6.3
Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, ...
excluding
Turkish Lira Sign
The lira ( tr, Türk lirası; Currency sign, sign: ₺; ISO 4217, ISO 4217 code: TRY; abbreviation: TL) is the official currency of Turkey and Northern Cyprus. One lira is divided into one hundred ''kuruş''.
History
Ottoman lira (1844–1923) ...
, which is included in Amendment 1, and five bidirectional control characters (Arabic Letter Mark, Left-To-Right Isolate, Right-To-Left Isolate, First Strong Isolate, Pop Directional Isolate), which are included in Amendment 2
* ISO/IEC 10646:2012 plus Amendments 1 and 2 =
Unicode 7.0
Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, whic ...
excluding the
Ruble sign
The ruble sign, , is the currency sign used for the Russian ruble, the official currency of Russia. Its form is a Cyrillic letter Er (Cyrillic), Р with an additional horizontal stroke. The design was approved on 11 December 2013 after a public ...
* ISO/IEC 10646:2014 plus Amendment 1 =
Unicode 8.0
Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, whi ...
excluding the
Lari sign, nine CJK unified ideographs, and 41 emoji characters
* ISO/IEC 10646:2014 plus Amendments 1 and 2 =
Unicode 9.0
Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, whi ...
excluding Adlam, Newa, Japanese TV symbols, and 74 emoji and symbols
* ISO/IEC 10646:2017 =
Unicode 10.0
Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, wh ...
excluding 285
Hentaigana
In the Japanese writing system, are variant forms of hiragana.
History
Today, with few exceptions, there is only one hiragana for each of the forty-five moras that are written without diacritics or digraphs. However, traditionally the ...
characters, 3 Zanabazar Square characters, and 56 emoji symbols
* ISO/IEC 10646:2017 plus Amendment 1 =
Unicode 11.0 excluding 46 Mtavruli Georgian capital letters, 5 CJK unified ideographs, and 66 emoji characters
* ISO/IEC 10646:2017 plus Amendments 1 and 2 =
Unicode 12.0
Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, whic ...
excluding 62 additional characters
* ISO/IEC 10646:2020 =
Unicode 13.0
Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, whic ...
* ISO/IEC 10646:2021 =
Unicode 14.0
Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, whi ...
See also
Related standards:
**
ISO/IEC 646
ISO/IEC 646 is a set of ISO/IEC standards, described as ''Information technology — ISO 7-bit coded character set for information interchange'' and developed in cooperation with ASCII at least since 1964. Since its first edition in 1 ...
(positions 0 to 127 are the same as in ISO/IEC 10646 and Unicode, and the numbers 646 and 10646 are similar)
**
ISO/IEC 2022
ISO/IEC 2022 ''Information technology—Character code structure and extension techniques'', is an ISO/IEC standard (equivalent to the ECMA standard ECMA-35, the ANSI standard ANSI X3.41 and the Japanese Industrial Standard JIS X 0202) in the f ...
''Information technology—Character code structure and extension techniques''
**
ISO/IEC 6429
ISO/IEC JTC 1, entitled "Information technology", is a joint technical committee (JTC) of the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC). Its purpose is to develop, maintain and pr ...
''C0 and C1 control codes''
**
ISO/IEC 8859
ISO/IEC 8859 is a joint ISO and IEC series of standards for 8-bit character encodings. The series of standards consists of numbered parts, such as ISO/IEC 8859-1, ISO/IEC 8859-2, etc. There are 15 parts, excluding the abandoned ISO/IEC 8859-12. ...
(positions 0 through 255 of UCS and Unicode are the same as in ISO/IEC 8859-1, alias ISO Latin 1)
**
ISO/IEC 14651
'ISO/IEC 14651:2016'', ''Information technology -- International string ordering and comparison -- Method for comparing character strings and description of the common template tailorable ordering'', is an ISO/IEC standard specifying an algorithm ...
''Information technology – International string ordering and comparison''
**
ISO 15924
ISO 15924, ''Codes for the representation of names of scripts'', is an international standard defining codes for writing systems or ''scripts'' (a "set of graphic characters used for the written form of one or more languages"). Each script is given ...
''Codes for the representation of names of scripts'' (each character is associated with one of those scripts)
*
Comparison of Unicode encodings
*
List of XML and HTML character entity references
In SGML, HTML and XML documents, the logical constructs known as ''character data'' and ''attribute values'' consist of sequences of characters, in which each character can manifest directly (representing itself), or can be represented by a series ...
*
List of Unicode fonts
This is a list of typefaces, which are separated into groups by distinct artistic differences. The list includes typefaces that have articles or that are referenced. Superfamilies that fall under more than one category have an asterisk (*) after t ...
*
Universal Character Set characters
The Unicode Consortium and the ISO/IEC JTC 1/SC 2/ WG 2 jointly collaborate on the list of the characters in the Universal Coded Character Set. The Universal Coded Character Set, most commonly called the Universal Character Set ( UCS, officia ...
*
ISO/IEC JTC 1/SC 2 ISO/IEC JTC 1/SC 2 Coded character sets is a standardization subcommittee of the Joint Technical Committee ISO/IEC JTC 1 of the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC), that devel ...
References
External links
Publicly available standards(ISO) – includes a copy of ISO 10646:2014 (129 MB ZIP file, released 2014-09-01) and electronic inserts (1.7 MB ZIP file)
ISO/IEC JTC1/SC2/WG2 the
working group
A working group, or working party, is a group of experts working together to achieve specified goals. The groups are domain-specific and focus on discussion or activity around a specific subject area. The term can sometimes refer to an interdis ...
in charge of ISO 10646
UTF-8 and Unicode FAQSIL's freeware fonts, editors and documentation*
ttp://archive.adaic.com/pol-hist/history/9x-history/reports/charset-Oct89.txt Character set issues for ADA 9xfrom October 1989, goes into some detail about the original, pre-merger DIS ISO-10646
{{List of International Electrotechnical Commission standards
Unicode
Character sets